[VL] Adding configurations on max write file size#11606
[VL] Adding configurations on max write file size#11606zhouyuan wants to merge 2 commits intoapache:mainfrom
Conversation
Signed-off-by: Yuan <yuanzhou@apache.org>
babb80d to
1b8fe5d
Compare
| .createWithDefault(10000) | ||
|
|
||
| val MAX_TARGET_FILE_SIZE_SESSION = | ||
| buildConf("spark.gluten.sql.columnar.backend.velox.maxTargetFileSizeSession") |
There was a problem hiding this comment.
what does Session mean here?
docs/velox-configuration.md
Outdated
| | spark.gluten.sql.columnar.backend.velox.maxSpillFileSize | 1GB | The maximum size of a single spill file created | | ||
| | spark.gluten.sql.columnar.backend.velox.maxSpillLevel | 4 | The max allowed spilling level with zero being the initial spilling level | | ||
| | spark.gluten.sql.columnar.backend.velox.maxSpillRunRows | 3M | The maximum row size of a single spill run | | ||
| | spark.gluten.sql.columnar.backend.velox.maxTargetFileSizeSession | 0b | The target file size for each output file when writing data. 0 means no limit on target file size, and the actual file size will be determined by other factors such as max partition number and shuffle batch size. | |
There was a problem hiding this comment.
Does it map to iceberg's write.target-file-size-bytes? and honor spark.sql.iceberg.advisory-partition-size? If so let's honor this config in Gluten as well.
If it only take effect on iceberg, we may just reuse iceberg's config instead of a new config.
There was a problem hiding this comment.
As of today Velox is using kMaxTargetFileSize to control the parquet write file size, so it will impact on all parquet write. In current iceberg write code path, this is parameter is picked to control the parquet size in each partition.
There was a problem hiding this comment.
In iceberg java the logic for partition control is:
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java#L695-L702
.option(SparkWriteOptions.ADVISORY_PARTITION_SIZE)
.sessionConf(SparkSQLProperties.ADVISORY_PARTITION_SIZE)
.tableProperty(TableProperties.SPARK_WRITE_ADVISORY_PARTITION_SIZE_BYTES)
.defaultValue(defaultValue)
write_options > session > table property
There was a problem hiding this comment.
This config for Iceberg should get from SparkWrite as codec, https://github.com/apache/iceberg/blob/main/spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L138
| .createWithDefault(10000) | ||
|
|
||
| val MAX_TARGET_FILE_SIZE_SESSION = | ||
| buildConf("spark.gluten.sql.columnar.backend.velox.maxTargetFileSizeSession") |
There was a problem hiding this comment.
Please remove Session suffix, this is Velox code config type suffix, not the config itself
Signed-off-by: Yuan <yuanzhou@apache.org>
What changes are proposed in this pull request?
Adding config for max write file size in Velox
How was this patch tested?
pass GHA
Velox UT
Was this patch authored or co-authored using generative AI tooling?